Online Shoppers Purchasing Intention Prediction¶

Authors: Julian Daduica, Stephanie Ta, and Wai Ming Wong

In [1]:
from ucimlrepo import fetch_ucirepo # raw data is from this package
import pandas as pd
import pandera as pa
from deepchecks.tabular.checks import FeatureLabelCorrelation, FeatureFeatureCorrelation
from deepchecks.tabular import Dataset
import altair as alt
import altair_ally as aly
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.dummy import DummyClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform
import os

Summary¶

This study attempts to build a classification model using a logistic regression algorithm to predict whether an online shopper will make a purchase based on their website interaction behaviour. The final classifier model achieved an accuracy of 87.6% on an unseen test dataset. Compare this to a dummy classifier model that always predicts no purchase, with an accuracy of 83.5%. While the logistic regression model performed reasonably well, it did not account for the class imbalance in the dataset, where there purchase target class was significantly less than the no purchase target class. From our logistic regression model, we identified that features PageValue and ExitRate were most important when making predictions. This can suggest that these features are the most significant when determining whether a customer will purchase or not. This model can provide insight for businesses to increase revenue by targeting and optimizing these features in marketing or sales campaigns. Further research addressing class imbalance and exploring alternative models or algorithms could improve predictions, which will increase the model’s ability for businesses to utilize.

Introduction¶

The growth of online shopping or e-commerce has completely changed how people shop. Online shopping provides the convenience of exploring many different online stores effortlessly from their homes. This gives people more freedom over their time and choices. With this, retail e-commerce sales are estimated to exceed 4.1 trillion U.S. dollars worldwide in 2024 from roughly 2.7 billion online shoppers (Taheer, 2024; Commerce, 2024). In an evergrowing consumerism society, it is important to understand consumers’ behaviours in addition to their intentions. This can allow businesses to optimize the online shopping experience and maximize revenue in such a massive industry. When shopping in person, a store employee may find it easy to determine a person’s purchasing intention through various social cues. However, while shopping online, companies and businesses find it much more difficult to decide on the intentions of their customers. Businesses need to find solutions from data on user interactions such as page clicks, time spent on pages, time of day or year, and much more. With the evergrowing increase in website traffic, businesses must distinguish between visitors with strong purchasing intentions and those who are simply browsing.

Machine learning is a powerful tool we can utilize to analyze and predict online shoppers purchasing intentions based on behavioural and interaction data. Using machine learning techniques, we can use algorithms and computation to analyze various features such as bounce rates, visitor type, time of year, and many others to identify patterns which can help predict purchasing intention. In this study, we aim to use a machine learning algorithm to predict online shoppers purchasing intentions. This will allow us to extract meaningful insights from user data. In such a lucrative field, determining purchasing intentions is vital to these companies and businesses for increasing revenue. This can help companies and businesses find optimal sales and marketing techniques, or personalize each customer experience on their website.

Methods¶

Data¶

The dataset used was sourced from the UCI Machine Learning Repository (Sakar et al., 2018) and can be found here. Each row in the dataset represents a web session on an e-commerce website, including details such as pages visited, time spent, and “Google Analytics” metrics for each page, such as “Bounce Rate”, “Exit Rate”, and “Page Value”. The “Special Day” variable highlights special events, while other web client attributes include OS, browser, region, traffic type, visitor type, and visit timing.

Specifically, our target in the dataset is if the page vistor made a purchase or not (Revenue, true or false)

The features that are in the dataset are:

  • The number of account management pages the visitor visited (Administrative)
  • The amount of time in seconds that the visitor spent on account management pages (Administrative_Duration)
  • The number of informational pages the visitor visited (Informational)
  • The amount of time in seconds that the visitor spent on informational pages (Informational_Duration)
  • The number of product related pages the visitor visited (ProductRelated)
  • The amount of time in seconds that the visitor spent on product related pages (ProductRelated_Duration)
  • The average bounce rate of the pages the visitor visited (BounceRates)
  • The average exit rate value of the pages that the visitor visited (ExitRates)
  • The average page value of the pages that the visitor visited (PageValues)
  • How close the time of visiting was to a special day, such as Mother's Day (SpecialDay)
  • The operating system of the visitor (OperatingSystems)
  • The browser of the visitor (Browser)
  • The region from which the visitor started the session from (Region)
  • How the visitor entered the website, such as a banner, SMS, etc. (TrafficType)
  • The visitor type, such as "new visitor", "returning visitor", etc. (VisitorType)
  • If the visitor visited the website on a weekend (Weekend)
  • The month in which the visitor visited the website (Month)

Information about the target and features was sourced from Sakar et al.'s study (2018).

For data validation, we verified our data using the information provided above. There are no null values in any columns, and the data types are as expected. We also performed range checks using common sense, such as ensuring that the maximum amount of time in seconds within a day does not exceed 24 x 60 x 60 = 86,400.

While we identified some duplicated rows, we decided not to remove them. As mentioned before, the dataset represents web sessions on an e-commerce website from different users. It is plausible for observations to have identical values, as they likely represent similar simple browser client information and simple visitor actions, which can result in duplicate data being recorded within the same month.

When conducting data validation for correlation between feature-feature and feature-target we found 3 high correlations between feature-feature. This includes Administrative-Administrative_Duration, Informational-Informational_Duration, ProductRelated-ProductRelated_Duration. The correlations between these pairs were found to be higher than the threshold check of 0.8. For this reason, we have removed Administrative, Informational, and ProductRelated columns from the dataset.

Analysis¶

The logistic regression algorithm was build for a classification model to predict whether the customers would make purchasing online in ecommerce sites based on the website visiting behaviours. All variables included in the data set were used to fit the model. Data was split with 70% into the training set and 30% into the test set. The hyperparameter was chosen using 5-fold cross validation with the accuracy score as the classification metric. All variables were standardized prior to model fitting. The Python programming language (Van Rossum & Drake, 2009) and the following Python packages were used to perform the analysis: numpy (Harris et al., 2020), Pandas (McKinney, 2010), altair (VanderPlas, 2018), scikit-learn (Pedregosa et al., 2011). The package for data fetching from UCI Machine Learning Respoitory was ucimlrepo (Truong et al.). The code used to perform the analysis and create this report can be found here .

Results and Discussion¶

To investigate the features in our dataset, we first visualized the correlation between each pair of features using a heatmap. From this, we can see that feautres are not too correlated with each other, and strong correlations only appear when a feature is compared to itself.

We also plotted the distribution of each numeric feature using density plots and the distribution of each categorical feature using bar plots. These plots were coloured by the target (false: blue and true: orange). For the numeric features, we can see that the target class distributions overlap and are of similar shape, but we decided to keep these features in our model since they may be useful for prediction in ombination with other features. For the catagorical features, the target class dsitributions seem to be similar, but again we decided to keep these features in our model since they may be useful for prediction in ombination with other features. We also noticed that there is an imbalance in our dataset in which there are more observations with the target = false and less observations with the target = true. We did not account for this imbalance in our analysis (i.e. the model and scoring metric) since that would be out of the scope for this project, which relies on only DSCI 571 knowledge.

In [2]:
#Dataset importing script from UCI ML Repository
# fetch dataset 
online_shoppers_purchasing_intention_dataset = fetch_ucirepo(id=468) 

# data (as pandas dataframes) and save it as csv
X = online_shoppers_purchasing_intention_dataset.data.features 
y = online_shoppers_purchasing_intention_dataset.data.targets
df = pd.concat([X, y], axis=1)
df.to_csv("../data/raw/raw_df.csv")
In [3]:
# validate that the raw data was saved as a csv file
assert os.path.isfile("../data/raw/raw_df.csv")
In [4]:
# validate data
schema = pa.DataFrameSchema(
    {"Administrative": pa.Column(int, nullable=False),
     "Administrative_Duration": pa.Column(float, pa.Check.between(0, 86400), nullable=False),
     "Informational": pa.Column(int, nullable=False),
     "Informational_Duration": pa.Column(float, pa.Check.between(0, 86400), nullable=False),
     "ProductRelated": pa.Column(int, nullable=False),
     "ProductRelated_Duration": pa.Column(float, pa.Check.between(0, 86400), nullable=False),
     "BounceRates": pa.Column(float, pa.Check.between(0, 1), nullable=False),
     "ExitRates": pa.Column(float, pa.Check.between(0, 1), nullable=False),
     "PageValues": pa.Column(float, nullable=False),
     "SpecialDay": pa.Column(float, pa.Check.between(0, 1), nullable=False),
     "Month": pa.Column(str, pa.Check.isin(["Jan", "Feb", "Mar", "Apr", "May", "June", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]), nullable=False),
     "OperatingSystems": pa.Column(int, nullable=False),
     "Browser": pa.Column(int, nullable=False),
     "Region": pa.Column(int, nullable=False),
     "TrafficType": pa.Column(int, nullable=False), 
     "VisitorType": pa.Column(str, pa.Check.isin(["New_Visitor", "Returning_Visitor", "Other"]), nullable=False),
     "Weekend": pa.Column(bool, nullable=False),
     "Revenue": pa.Column(bool, nullable=False),
    },
    checks=[
        pa.Check(lambda dfpa: ~(dfpa.isna().all(axis=1)).any(), error="Empty rows found.")
    ]
)

schema.validate(df, lazy=True)
Out[4]:
Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues SpecialDay Month OperatingSystems Browser Region TrafficType VisitorType Weekend Revenue
0 0 0.0 0 0.0 1 0.000000 0.200000 0.200000 0.000000 0.0 Feb 1 1 1 1 Returning_Visitor False False
1 0 0.0 0 0.0 2 64.000000 0.000000 0.100000 0.000000 0.0 Feb 2 2 1 2 Returning_Visitor False False
2 0 0.0 0 0.0 1 0.000000 0.200000 0.200000 0.000000 0.0 Feb 4 1 9 3 Returning_Visitor False False
3 0 0.0 0 0.0 2 2.666667 0.050000 0.140000 0.000000 0.0 Feb 3 2 2 4 Returning_Visitor False False
4 0 0.0 0 0.0 10 627.500000 0.020000 0.050000 0.000000 0.0 Feb 3 3 1 4 Returning_Visitor True False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12325 3 145.0 0 0.0 53 1783.791667 0.007143 0.029031 12.241717 0.0 Dec 4 6 1 1 Returning_Visitor True False
12326 0 0.0 0 0.0 5 465.750000 0.000000 0.021333 0.000000 0.0 Nov 3 2 1 8 Returning_Visitor True False
12327 0 0.0 0 0.0 6 184.250000 0.083333 0.086667 0.000000 0.0 Nov 3 2 1 13 Returning_Visitor True False
12328 4 75.0 0 0.0 15 346.000000 0.000000 0.021053 0.000000 0.0 Nov 2 2 3 11 Returning_Visitor False False
12329 0 0.0 0 0.0 3 21.250000 0.000000 0.066667 0.000000 0.0 Nov 3 2 1 2 New_Visitor True False

12330 rows × 18 columns

In [5]:
# split the training set and testing set and save them as csv files
train_df, test_df = train_test_split(df, test_size=0.3, random_state=123)
train_df.to_csv("../data/processed/train_df.csv")
test_df.to_csv("../data/processed/test_df.csv")
In [6]:
# validate that the training and testing sets were saved as csv files
assert os.path.isfile("../data/processed/train_df.csv")
assert os.path.isfile("../data/processed/test_df.csv")
In [7]:
# split X, y in the training set and testing set
X_train = train_df.drop(columns=["Revenue"])
X_test = test_df.drop(columns=["Revenue"])
y_train = train_df["Revenue"]
y_test = test_df["Revenue"]
In [8]:
# begin exploratory data analysis
train_df.describe()
Out[8]:
Administrative Administrative_Duration Informational Informational_Duration ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues SpecialDay OperatingSystems Browser Region TrafficType
count 8631.000000 8631.000000 8631.000000 8631.000000 8631.000000 8631.000000 8631.000000 8631.000000 8631.000000 8631.000000 8631.000000 8631.000000 8631.000000 8631.000000
mean 2.318851 80.035963 0.496582 33.735985 31.506546 1179.548652 0.022252 0.043180 5.765987 0.063330 2.129765 2.353261 3.150852 4.071371
std 3.326228 173.132521 1.244019 138.995400 44.119701 1895.590842 0.048634 0.048648 18.215382 0.202414 0.925164 1.727358 2.408261 4.011918
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 1.000000 1.000000
25% 0.000000 0.000000 0.000000 0.000000 7.000000 182.208333 0.000000 0.014286 0.000000 0.000000 2.000000 2.000000 1.000000 2.000000
50% 1.000000 7.000000 0.000000 0.000000 18.000000 593.701980 0.003077 0.025466 0.000000 0.000000 2.000000 2.000000 3.000000 2.000000
75% 4.000000 93.115833 0.000000 0.000000 37.000000 1439.177083 0.017124 0.050000 0.000000 0.000000 3.000000 2.000000 4.000000 4.000000
max 27.000000 3398.750000 16.000000 2549.375000 584.000000 63973.522230 0.200000 0.200000 361.763742 1.000000 8.000000 13.000000 9.000000 20.000000
In [9]:
train_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 8631 entries, 2476 to 3582
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Administrative           8631 non-null   int64  
 1   Administrative_Duration  8631 non-null   float64
 2   Informational            8631 non-null   int64  
 3   Informational_Duration   8631 non-null   float64
 4   ProductRelated           8631 non-null   int64  
 5   ProductRelated_Duration  8631 non-null   float64
 6   BounceRates              8631 non-null   float64
 7   ExitRates                8631 non-null   float64
 8   PageValues               8631 non-null   float64
 9   SpecialDay               8631 non-null   float64
 10  Month                    8631 non-null   object 
 11  OperatingSystems         8631 non-null   int64  
 12  Browser                  8631 non-null   int64  
 13  Region                   8631 non-null   int64  
 14  TrafficType              8631 non-null   int64  
 15  VisitorType              8631 non-null   object 
 16  Weekend                  8631 non-null   bool   
 17  Revenue                  8631 non-null   bool   
dtypes: bool(2), float64(7), int64(7), object(2)
memory usage: 1.1+ MB
In [10]:
aly.alt.data_transformers.enable('vegafusion')

feature_density_plot = aly.dist(train_df, color='Revenue')

feature_density_plot
Out[10]:

Figure 1. Distribution of numeric features for each target class.

In [11]:
feature_bar_plot = aly.dist(train_df.assign(churn=lambda df: df['Revenue'].astype(object)), dtype='object', color='Revenue')

feature_bar_plot
Out[11]:

Figure 2. Distribution of categorical features for each target class.

In [12]:
corr_df = (
    train_df
    .corr('spearman', numeric_only = True)
    .abs()                      
    .stack()                   
    .reset_index(name = 'corr')
    .query(('level_0 < level_1')))  
corr_df

correlation_heatmap = alt.Chart(corr_df).mark_rect().encode(
    x = alt.X('level_0:N', title = 'Feature 1'),
    y = alt.Y('level_1:N', title = 'Feature 2'),
    size = alt.Size('corr:Q', title = 'Correlation Strength'),
    color = alt.Color('corr:Q'),
    tooltip = ['level_0', 'level_1', 'corr']
).properties(
    width = 600,
    height = 600,
    title = "Correlation Heatmap Between all Features"
)

correlation_numbers = alt.Chart(corr_df).mark_text(baseline='middle').encode(
    x = alt.X('level_0:N', title='Feature 1'),
    y = alt.Y('level_1:N', title='Feature 2'),
    text=alt.Text('corr:Q', format='.2f')
)

correlation_heatmap + correlation_numbers
Out[12]:

Figure 3. Correlation plot between all features in dataset

In [13]:
# drop features with high feature - feature correlations
train_df = train_df.drop(columns = ["Administrative", 
                                    "Informational", 
                                    "ProductRelated"])

test_df = test_df.drop(columns = ["Administrative", 
                                  "Informational", 
                                  "ProductRelated"])
In [14]:
train_df_data_valid = Dataset(train_df, label="Revenue", cat_features=[])

# the maximum threshold allowed
threshold = 0.80

# validate training data for anomalous correlations between target variable and features
check_feature_target_corr = FeatureLabelCorrelation().add_condition_feature_pps_less_than(threshold)
check_feature_terget_corr_result = check_feature_target_corr.run(dataset = train_df_data_valid)

if not check_feature_terget_corr_result.passed_conditions():
    raise ValueError(f"Feature-target correlation exceeds the maximum acceptable threshold of {threshold}.")

# validate training data anomalous correlations between features
check_feature_feature_corr = FeatureFeatureCorrelation().add_condition_max_number_of_pairs_above_threshold(threshold = threshold)
check_feature_feature_corr_result = check_feature_feature_corr.run(dataset = train_df_data_valid)

if not check_feature_feature_corr_result.passed_conditions():
    raise ValueError(f"Feature-feature correlation exceeds the maximum acceptable threshold of {threshold}.")
In [15]:
# create baseline model to compare final model to
dummy_classifier = DummyClassifier()
dummy_classifier.fit(X_train, y_train)
dummy_cv_scores = pd.DataFrame(
    cross_validate(dummy_classifier, X_train, y_train, cv = 5, return_train_score = True))
mean_dummy_validation_accuracy = dummy_cv_scores['test_score'].mean()
mean_dummy_validation_accuracy
Out[15]:
0.8494960081213042

We chose to use a logistic regression model for our classification model. To find the model with the highest acuracy in predicting our target, we used 5-fold cross-validation to select our best value of the C hyperparameter. We found that the best C was 0.768.

In [16]:
# lists of each type of feature
numeric_cols = ['Administrative', 'Administrative_Duration',
                'Informational', 'Informational_Duration',
                'ProductRelated', 'ProductRelated_Duration',
                'BounceRates', 'ExitRates',
                'PageValues', 'SpecialDay']
categorical_cols = ['Weekend', 'OperatingSystems',
                    'Browser', 'Region',
                    'TrafficType', 'VisitorType']
ordinal_cols = ['Month']
In [17]:
# make preproccessor, note imputation is not needed since there are no null values in the data set
month_levels = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'June', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

preprocessor = make_column_transformer(
    (StandardScaler(), numeric_cols),
    (OneHotEncoder(sparse_output=False, handle_unknown='ignore'), categorical_cols),
    (OrdinalEncoder(categories=[month_levels]), ordinal_cols)
)
In [18]:
# make pipeline including preprocessor and logistic regression model
log_reg_pipe = make_pipeline(
    preprocessor, LogisticRegression(max_iter=2000, random_state=123)
)

From cross-validation, we found that our best logistic regression model with C = 0.768 yielded an validation accuracy score of 88.6%, which is slightly better (3.7%) compared to the validation accuracy score of our dummy classifier using a most-frequent strategy (84.9%).

In [19]:
# tune hyperparameter C of the logistic regression model
param_grid = {
    "logisticregression__C": loguniform(1e-3, 1e3),
}

random_search = RandomizedSearchCV(
    log_reg_pipe,
    param_grid,
    n_iter=100,
    verbose=1,
    n_jobs=-1,
    random_state=123,
    return_train_score=True,
)

random_search.fit(X_train, y_train)

print("Best hyperparameter value: ", random_search.best_params_)
print("Best score: %0.3f" % (random_search.best_score_))
Fitting 5 folds for each of 100 candidates, totalling 500 fits
Best hyperparameter value:  {'logisticregression__C': 0.7684071705306554}
Best score: 0.886

After testing our model using the testing data, we found that our best logistic regression model with C = 0.768 had a test accuracy score of 87.6%, which is again a bit better (4.1%) that the test accuracy score of our dummy classifier using a most-frequent strategy (83.5%).

In [20]:
# score dummy and best logistic regression model on test set
dummy_test_score = dummy_classifier.score(X_test, y_test)
log_reg_test_score = random_search.score(X_test, y_test)


print("Dummy classifier test score: %0.3f" % dummy_test_score)
print("Best logistic regression model test score: %0.3f" % log_reg_test_score)
Dummy classifier test score: 0.835
Best logistic regression model test score: 0.876

To find out how important each feature is considered by our model for predicting the target class, we investigated the model's weight of each feature. We found that the ExitRates and PagesValues features seem to be the most important for determining the target.

In [21]:
# find weights of each feature
best_estimator = random_search.best_estimator_
feature_names = best_estimator['columntransformer'].get_feature_names_out() # get feature names
weights = best_estimator["logisticregression"].coef_ # get feature coefficients

feat_weights = pd.DataFrame(weights, columns = feature_names)
feat_weights = feat_weights.T.reset_index()
feat_weights = feat_weights.rename(columns={'index': 'feature', 0: "weight"})
feat_weights['feature'] = feat_weights['feature'].str.split('__', expand = True)[1]
In [22]:
# average the weights of each overall feature

feat_weights['overall_feature'] = feat_weights['feature'].str.split('_', expand = True)[0]

# absolute value the weights so that positive and negative ones don't cancel eachother out
feat_weights['absolute_value_weight'] = abs(feat_weights['weight'])

absolute_feat_weights = pd.DataFrame(feat_weights.groupby('overall_feature'
    ).mean(numeric_only=True
    ).sort_values('absolute_value_weight', ascending = False
    )['absolute_value_weight']).reset_index()


alt.Chart(absolute_feat_weights).mark_bar().encode(
    x = alt.X('absolute_value_weight').title('Absolute Value Weight'),
    y = alt.Y('overall_feature').sort('x').title('Overall Feature')
)
Out[22]:

Figure 4. Model features and their associated absolute value weight.

Our model achieves a high accuracy score of 87.6%, suggesting its potential usefulness in predicting whether a customer will purchase a product based on their behavioral and interaction data with a business's website. However, its performance is only marginally better than a model that always predicts a customer will not make a purchase (accuracy = 83.5%).

The analysis could be enhanced by addressing the imbalance in the target classes within our data and by using alternative scoring metrics. This approach might result in a better-performing model and a more robust evaluation. Additionally, exploring other classification models, such as SVM with RBF kernel and KNN, and comparing their performance to our logistic regression model could provide valuable insights.

We also identified that the features PageValue (the average value of web pages visited by the visitor) and ExitRate (the average exit rate of web pages visited) were the most significant for predictions. These findings offer meaningful insights into how businesses can anticipate customer intentions and develop strategies to increase sales, such as enhancing PageValue and reducing ExitRate for potential customers. However, these insights may evolve if the model incorporates the class imbalance in the data.

References¶

Commerce, S. (2024, May 14). 43 ECommerce statistics in 2024 (global and U.s. data). SellersCommerce. https://www.sellerscommerce.com/blog/ecommerce-statistics/

Taheer, F. (2024, September 8). Online shopping statistics: How many people shop online in 2024? OptinMonster. https://optinmonster.com/online-shopping-statistics/

Sakar, C.O., Polat, S.O., Katircioglu, M. et al. Neural Comput & Applic (2018). “UCI Machine Learning Repository.” https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset

Van Rossum, Guido, and Fred L. Drake. (2009). Python 3 Reference Manual. Scotts Valley, CA: CreateSpace.

Harris, C.R. et al., (2020). Array programming with NumPy. Nature, 585, pp.357–362.

McKinney, Wes. (2010). “Data Structures for Statistical Computing in Python.” In Proceedings of the 9th Python in Science Conference, edited by Stéfan van der Walt and Jarrod Millman, 51–56.

VanderPlas, J. et al., (2018). Altair: Interactive statistical visualizations for python. Journal of open source software, 3(32), p.1057.

Pedregosa, F. et al., (2011). Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), pp.2825–2830.

Truong, P. et al. ucimlrepo package https://github.com/uci-ml-repo/ucimlrepo/tree/main